<%
=begin
apps: etcd
platforms: kubernetes, tanzu-application-catalog
id: understand_default_configuration
title: Understand how the etcd cluster is configured
category: get-started
weight: 40
highlight: 40
=end %>

### Bootstrapping and discovery

The <%= variable :catalog_name, :platform %> etcd chart uses static discovery configured via environment variables to bootstrap the etcd cluster. Based on the number of initial replicas, and using the A records added to the DNS configuration by the headless service, the chart can calculate every advertised peer URL.

The chart makes use of some extra elements offered by Kubernetes to ensure the bootstrapping is successful:

* It sets a "Parallel" Pod Management Policy. This is critical, since all the etcd replicas should be created simultaneously to guarantee they can find each other.
* It records "not ready" pods in the DNS, so etcd replicas are reachable using their associated FQDN before they're actually ready.

Learn more about [etcd discovery](https://etcd.io/docs/current/op-guide/clustering/#discovery), [Pod Management Policies](https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#pod-management-policies) and [recording "not ready" pods](https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/#pod-s-hostname-and-subdomain-fields).

Here is an example of the environment configuration bootstrapping an etcd cluster with 3 replicas:

| Member  | Variable                         | Value                                                                                                                                                                                                 |
|---------|----------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 0       | ETCD_NAME                        | etcd-0                                                                                                                                                                                                |
| 0       | ETCD_INITIAL_ADVERTISE_PEER_URLS | http://etcd-1.etcd-headless.default.svc.cluster.local:2380                                                                                                                                            |
|---------|----------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 1       | ETCD_NAME                        | etcd-1                                                                                                                                                                                                |
| 1       | ETCD_INITIAL_ADVERTISE_PEER_URLS | http://etcd-1.etcd-headless.default.svc.cluster.local:2380                                                                                                                                            |
|---------|----------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 2       | ETCD_NAME                        | etcd-2                                                                                                                                                                                                |
| 2       | ETCD_INITIAL_ADVERTISE_PEER_URLS | http://etcd-2.etcd-headless.default.svc.cluster.local:2380                                                                                                                                            |
|---------|----------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| *       | ETCD_INITIAL_CLUSTER_STATE       | new                                                                                                                                                                                                   |
| *       | ETCD_INITIAL_CLUSTER_TOKEN       | etcd-cluster-k8s                                                                                                                                                                                      |
| *       | ETCD_INITIAL_CLUSTER             | etcd-0=http://etcd-0.etcd-headless.default.svc.cluster.local:2380,etcd-1=http://etcd-1.etcd-headless.default.svc.cluster.local:2380,etcd-2=http://etcd-2.etcd-headless.default.svc.cluster.local:2380 |

The probes (readiness & liveness) are delayed 60 seconds by default, to give the etcd replicas time to start and find each other. After that period, the *etcdctl endpoint health* command is used to periodically perform health checks on every replica.

### Scalability

The <%= variable :catalog_name, :platform %> etcd chart uses etcd reconfiguration operations to add/remove members of the cluster during scaling.

When scaling down, a "pre-stop" lifecycle hook is used to ensure that the *etcdctl member remove* command is executed. The hook stores the output of this command in the persistent volume attached to the etcd pod. This hook is also executed when the pod is manually removed using the *kubectl delete pod* command or rescheduled by Kubernetes for any reason. This implies that the cluster can be scaled up/down without human intervention.

Here is an example to explain how this works:

1. An etcd cluster with three members running on a three-nodes Kubernetes cluster is bootstrapped.
1. After a few days, the cluster administrator decides to upgrade the kernel on one of the cluster nodes. To do so, the administrator drains the node. Pods running on that node are rescheduled to a different one.
1. During the pod eviction process, the "pre-stop" hook removes the etcd member from the cluster. Thus, the etcd cluster is scaled down to only two members.
1. Once the pod is scheduled on another node and initialized, the etcd member is added again to the cluster using the *etcdctl member add* command. Thus, the etcd cluster is scaled up to three replicas.

If, for whatever reason, the "pre-stop" hook fails at removing the member, the initialization logic is able to detect that something went wrong by checking the *etcdctl member remove* command output that was stored in the persistent volume. It then uses the *etcdctl member update* command to add back the member. In this case, the cluster isn't automatically scaled down/up while the pod is recovered. Therefore, when other members attempt to connect to the pod, it may cause warnings or errors like the one below:

~~~
E | rafthttp: failed to dial XXXXXXXX on stream Message (peer XXXXXXXX failed to find local node YYYYYYYYY)
I | rafthttp: peer XXXXXXXX became inactive (message send to peer failed)
W | rafthttp: health check for peer XXXXXXXX could not connect: dial tcp A.B.C.D:2380: i/o timeout
~~~

Learn more about [etcd runtime configuration](https://etcd.io/docs/current/op-guide/runtime-configuration/) and how to safely drain a Kubernetes node](https://kubernetes.io/docs/tasks/administer-cluster/safely-drain-node/).

### Cluster updates

When updating the etcd StatefulSet (such as when upgrading the chart version via the *helm upgrade* command), every pod must be replaced following the StatefulSet update strategy.

The chart uses a "RollingUpdate" strategy by default and with default Kubernetes values. In other words, it updates each Pod, one at a time, in the same order as Pod termination (from the largest ordinal to the smallest). It will wait until an updated Pod is "Running" and "Ready" prior to updating its predecessor.

Learn more about [StatefulSet update strategies](https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#update-strategies).

### Disaster recovery

If, for whatever reason, (N-1)/2 members of the cluster fail and the "pre-stop" hooks also fail at removing them from the cluster, the cluster disastrously fails, irrevocably losing quorum. Once quorum is lost, the cluster cannot reach consensus and therefore cannot continue accepting updates. Under this circumstance, the only possible solution is usually to restore the cluster from a snapshot.

> IMPORTANT: All members should restore using the same snapshot.

The <%= variable :catalog_name, :platform %> etcd chart solves this problem by optionally offering a Kubernetes cron job that periodically snapshots the keyspace and stores it in a RWX volume. In case the cluster disastrously fails, the pods will automatically try to restore it using the last avalable snapshot.

[Learn how to enable this disaster recovery feature](../../administration/enable-disaster-recovery).

The chart also sets by default a "soft" Pod AntiAffinity to reduce the risk of the cluster failing disastrously.

Learn more about [etcd recovery](https://etcd.io/docs/current/op-guide/recovery), [Kubernetes cron jobs](https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/) and [pod affinity and anti-affinity](https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#affinity-and-anti-affinity)
